Exploring The Simpsons Show
Introduction
The Simpsons holds the title of the world’s longest-running animated sitcom. It was created by the American writer Matt Groening in 1989 and is nowadays in its 31st season. The show gives a satirical depiction of the working-class life of the Simpson family, which mainly consists of Homer, Marge, Bart, Lisa, and the little Maggie.
Several resources about the show are available at the data science platform Kaggle and the #tidytuesday Github repository. Let us explore these data sets and see which insights we can get from them. We start by loading the packages required for the analysis.
library(tidyverse)
library(tidytext)
library(topicmodels)
library(scales)
library(readr)
library(kableExtra)
library(treemapify)
library(gridExtra)
library(igraph)
library(ggpubr)
library(ggraph)
library(ggwordcloud)
library(ggcorrplot)
library(GGally)
library(udpipe)
theme_set(theme_bw())
show_table <- function(x, caption = "", head = 50, scroll = FALSE, full.width = FALSE,
digits = 2, col.names = NA, align = NULL){
table <- x %>%
head(head) %>%
kable(caption = caption, digits = digits, col.names = col.names, align = align,
format.args = list(decimal.mark = ".", big.mark = ",")) %>%
kable_styling("striped", position = "left", full_width = full.width)
if(scroll){
table <- table %>%
scroll_box(width = "100%", height = "500px")
}
return(table)
}
firstup <- function(x) {
substr(x, 1, 1) <- toupper(substr(x, 1, 1))
x
}
colors <- c('#0094c7', '#f14e28', "#62AF67ff", "#989DCDff")
palette <- c("#acba81ff", "#2a9430ff", "#ae6b1bff","#024CF0", "#28536bff", "#68aedeff", "#8928e8ff","#f25d30ff",
"#d63d2aff", "orchid", "#b49ba0ff", "darkorange", "#ef4d8bff", "#ffc510ff", "lightsalmon1", "azure4",
"aquamarine3")
palette2 <- c("#6891AB", "#81d2c7ff", "#63D676", "#ffe579ff", "#ffaa60ff", "#ffc09fff", "#f994b6ff", "#b892ffff")The five data sets we will be working with are simpsons_characters.csv, simpsons_locations.csv, simpsons_script_lines.csv, simpsons_episodes.csv, and simpsons-guests.csv.
characters <- read_delim("Data/simpsons_characters.csv", delim = ",")
locations <- read_delim("Data/simpsons_locations.csv", delim = ",")
episodes <- read_delim("Data/simpsons_episodes.csv", delim = ",")
dialogues <- read_delim("Data/simpsons_script_lines.csv", delim = ",")
guests <- read_delim("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-08-27/simpsons-guests.csv",
delim = "|", quote = "")
Pre-processing
Before starting off with the actual analysis, it is convenient to perform some pre-processing operations.
charactersdata set: we recode the levels ofgenderasMale,Female, andUnknown.dialoguesdata set: we retain the speaking lines and create identifiers for the row number.episodesdata set: we create a variable denoting the episode number. Its format follows the rules described here (i.e., the first number refers to the order it aired during the entire series, and the second one to the episode number within its season). We also filter theepisodesuntil season 27, since the information about season 28 is partial (there are only 4 episodes).guestsdata set: we exclude the Movie season entries and create a logical variableself, which is equal toTRUEif a given guest star has played themselves in a particular episode, andFALSEif they have voiced a regular character.
characters <- characters %>%
mutate(gender = fct_explicit_na(gender, na_level = "Unknown"),
gender = fct_recode(gender, Male = "m", Female = "f"))
dialogues <- dialogues %>%
filter(speaking_line) %>%
mutate(line_number = row_number()) %>%
select(line_number, episode_id, number, role = raw_character_text,
location = raw_location_text, line = spoken_words)
episodes <- episodes %>%
na.omit() %>%
mutate(part1 = sprintf("%03d", id),
number_in_season = sprintf("%02d", number_in_season)) %>%
unite("part2", c(season, number_in_season), sep = "", remove = FALSE) %>%
unite("number", c(part1, part2), sep = "–", remove = FALSE) %>%
filter(season <= 27) %>%
select(season, number, episode_id = number_in_series, prod_code = production_code,
year = original_air_year, title, rating = imdb_rating,
votes = imdb_votes, views, us_views = us_viewers_in_millions)
guests <- guests %>%
filter(!season %in% "Movie") %>%
mutate(season = parse_number(season)) %>%
separate_rows(role, sep = ";\\s+") %>%
mutate(self = str_detect(role, "self|selves"),
role = ifelse(role == "Edna Krabappel", "Edna Krabappel-Flanders", role)) %>%
rename(title = episode_title, prod_code = production_code)Let us have a look at the pre-processed data sets characters, locations, episodes, dialogues, and guests.
Characters
| id | name | normalized_name | gender |
|---|---|---|---|
| 7 | Children | children | Unknown |
| 12 | Mechanical Santa | mechanical santa | Unknown |
| 13 | Tattoo Man | tattoo man | Unknown |
| 16 | DOCTOR ZITSOFSKY | doctor zitsofsky | Unknown |
| 20 | Students | students | Unknown |
| 24 | Little Boy | little boy | Unknown |
| 26 | Lewis Clark | lewis clark | Unknown |
| 27 | Little Girl | little girl | Unknown |
| 29 | Bubbles | bubbles | Unknown |
| 30 | Moldy | moldy | Unknown |
| 34 | Ticket Seller | ticket seller | Unknown |
| 35 | Elf #1 | elf 1 | Unknown |
| 36 | Elves | elves | Unknown |
| 37 | Dog’s Owner | dogs owner | Unknown |
| 39 | Kids | kids | Unknown |
| 41 | Conductor | conductor | Unknown |
| 42 | Secretary | secretary | Unknown |
| 46 | Sydney | sydney | Unknown |
| 47 | Cecile Shapiro | cecile shapiro | Unknown |
| 48 | Ian | ian | Unknown |
| 49 | Calvin | calvin | Unknown |
| 50 | Martin Prince, Sr. | martin prince sr | Unknown |
| 51 | Richard | richard | Unknown |
| 53 | Wendell Borton | wendell borton | Unknown |
| 57 | Smilin’ Joe Fission | smilin joe fission | Unknown |
| 58 | Rod #1 | rod 1 | Unknown |
| 59 | Rod #2 | rod 2 | Unknown |
| 60 | RODS | rods | Unknown |
| 61 | Workman #1 | workman 1 | Unknown |
| 62 | Foreman | foreman | Unknown |
| 63 | TERRI & SHERRI | terri sherri | Unknown |
| 64 | PUNK TEENAGER | punk teenager | Unknown |
| 65 | Tv Announcer #1 | tv announcer 1 | Unknown |
| 66 | Tv Announcer #2 | tv announcer 2 | Unknown |
| 67 | Jingle Chorus | jingle chorus | Unknown |
| 68 | Sylvia Winfield | sylvia winfield | Unknown |
| 69 | Old Man Winfield | old man winfield | Unknown |
| 70 | Councilman #1 | councilman 1 | Unknown |
| 72 | Councilman #2 | councilman 2 | Unknown |
| 73 | COUNCILMEN #1/#2 | councilmen 12 | Unknown |
| 74 | Demonstrator #1 | demonstrator 1 | Unknown |
| 75 | Crowd | crowd | Unknown |
| 76 | MR. GAMMILL | mr gammill | Unknown |
| 77 | TOM | tom | Unknown |
| 78 | Mrs. Long | mrs long | Unknown |
| 79 | Wife #1 | wife 1 | Unknown |
| 80 | Wife #2 | wife 2 | Unknown |
| 81 | Other Women | other women | Unknown |
| 82 | Nice Father | nice father | Unknown |
| 83 | Nice Boy | nice boy | Unknown |
Locations
| id | name | normalized_name |
|---|---|---|
| 1 | Street | street |
| 2 | Car | car |
| 3 | Springfield Elementary School | springfield elementary school |
| 4 | Auditorium | auditorium |
| 5 | Simpson Home | simpson home |
| 6 | KITCHEN | kitchen |
| 7 | SHOPPING MALL PARKING LOT | shopping mall parking lot |
| 8 | Springfield Mall | springfield mall |
| 9 | The Happy Sailor Tattoo Parlor | the happy sailor tattoo parlor |
| 10 | Springfield Nuclear Power Plant | springfield nuclear power plant |
| 11 | PLANT | plant |
| 12 | DERMATOLOGY CLINIC | dermatology clinic |
| 13 | Laboratory | laboratory |
| 14 | Circus of Values | circus of values |
| 15 | Moe’s Tavern | moe tavern |
| 16 | Santa School | santa school |
| 17 | Santa’s Workshop | santa workshop |
| 18 | WORKSHOP | workshop |
| 19 | PERSONNEL OFFICE | personnel office |
| 20 | Springfield Downs Dog Track | springfield downs dog track |
| 21 | SPRINGFIELD DOWNS | springfield downs |
| 22 | PADDOCK | paddock |
| 23 | SPRINGFIELD DOWN | springfield down |
| 24 | SPRINGFIELD DOWNS PARKING LOT | springfield downs parking lot |
| 25 | Simpson Living Room | simpson living room |
| 26 | Springfield Elementary School Playground | springfield elementary school playground |
| 27 | CLASSROOM | classroom |
| 28 | Skinner’s Office | skinner office |
| 29 | Homer’s Car | homer car |
| 30 | NEW SCHOOL | new school |
| 31 | Opera House | opera house |
| 32 | OLD SCHOOL | old school |
| 33 | NEW CLASSROOM | new classroom |
| 34 | SCHOOL BUILDING | school building |
| 35 | Simpson Back Porch | simpson back porch |
| 36 | Bus | bus |
| 37 | Road | road |
| 38 | Conference Room | conference room |
| 39 | COFFEE ROOM | coffee room |
| 40 | Bar | bar |
| 41 | Berger’s Burgers | berger burgers |
| 42 | REFRIGERATOR | refrigerator |
| 43 | Bart’s Bedroom | bart bedroom |
| 44 | Simpson Backyard | simpson backyard |
| 45 | Simpson Neighborhood | simpson neighborhood |
| 46 | Master Bedroom | master bedroom |
| 47 | LIVING ROOM | living room |
| 48 | Springfield Town Hall | springfield town hall |
| 49 | CITY COUNCIL CHAMBERS | city council chambers |
| 50 | Park | park |
Episodes
| season | number | episode_id | prod_code | year | title | rating | votes | views | us_views |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 010–110 | 10 | 7G10 | 1,990 | Homer’s Night Out | 7.4 | 1,511 | 50,816 | 30.3 |
| 1 | 012–112 | 12 | 7G12 | 1,990 | Krusty Gets Busted | 8.3 | 1,716 | 62,561 | 30.4 |
| 2 | 014–201 | 14 | 7F03 | 1,990 | Bart Gets an “F” | 8.2 | 1,638 | 59,575 | 33.6 |
| 2 | 017–204 | 17 | 7F01 | 1,990 | Two Cars in Every Garage and Three Eyes on Every Fish | 8.1 | 1,457 | 64,959 | 26.1 |
| 2 | 019–206 | 19 | 7F08 | 1,990 | Dead Putting Society | 8.0 | 1,366 | 50,691 | 25.4 |
| 2 | 021–208 | 21 | 7F06 | 1,990 | Bart the Daredevil | 8.4 | 1,522 | 57,605 | 26.2 |
| 2 | 023–210 | 23 | 7F10 | 1,991 | Bart Gets Hit by a Car | 7.8 | 1,340 | 56,486 | 24.8 |
| 2 | 026–213 | 26 | 7F13 | 1,991 | Homer vs. Lisa and the 8th Commandment | 8.0 | 1,329 | 58,277 | 26.2 |
| 2 | 028–215 | 28 | 7F16 | 1,991 | Oh Brother, Where Art Thou? | 8.2 | 1,413 | 47,426 | 26.8 |
| 2 | 030–217 | 30 | 7F17 | 1,991 | Old Money | 7.6 | 1,243 | 44,331 | 21.2 |
| 2 | 032–219 | 32 | 7F19 | 1,991 | Lisa’s Substitute | 8.5 | 1,684 | 52,770 | 17.7 |
| 2 | 035–222 | 35 | 7F22 | 1,991 | Blood Feud | 8.0 | 1,223 | 52,829 | 17.3 |
| 3 | 037–302 | 37 | 8F01 | 1,991 | Mr. Lisa Goes to Washington | 7.7 | 1,274 | 52,098 | 20.2 |
| 3 | 039–304 | 39 | 8F03 | 1,991 | Bart the Murderer | 8.7 | 1,446 | 64,342 | 20.8 |
| 3 | 041–306 | 41 | 8F05 | 1,991 | Like Father, Like Clown | 7.7 | 1,262 | 45,586 | 20.2 |
| 3 | 044–309 | 44 | 8F07 | 1,991 | Saturdays of Thunder | 7.9 | 1,194 | 55,808 | 24.7 |
| 3 | 046–311 | 46 | 8F09 | 1,991 | Burns Verkaufen der Kraftwerk | 8.2 | 1,291 | 55,987 | 21.1 |
| 3 | 048–313 | 48 | 8F11 | 1,992 | Radio Bart | 8.5 | 1,365 | 58,919 | 24.2 |
| 3 | 051–316 | 51 | 8F16 | 1,992 | Bart the Lover | 8.3 | 1,272 | 53,123 | 20.5 |
| 3 | 053–318 | 53 | 8F15 | 1,992 | Separate Vocations | 8.2 | 1,201 | 61,508 | 23.7 |
| 3 | 055–320 | 55 | 8F19 | 1,992 | Colonel Homer | 7.9 | 1,233 | 46,901 | 25.5 |
| 3 | 058–323 | 58 | 8F22 | 1,992 | Bart’s Friend Falls in Love | 7.8 | 1,160 | 48,058 | 19.5 |
| 4 | 060–401 | 60 | 8F24 | 1,992 | Kamp Krusty | 8.4 | 1,414 | 67,081 | 21.8 |
| 4 | 065–406 | 65 | 9F03 | 1,992 | Itchy & Scratchy: The Movie | 8.2 | 1,293 | 55,740 | 20.1 |
| 4 | 069–410 | 69 | 9F08 | 1,992 | Lisa’s First Word | 8.5 | 1,350 | 62,070 | 28.6 |
| 4 | 072–413 | 72 | 9F11 | 1,993 | Selma’s Choice | 8.0 | 1,153 | 56,396 | 24.5 |
| 1 | 007–107 | 7 | 7G09 | 1,990 | The Call of the Simpsons | 7.9 | 1,638 | 57,793 | 27.6 |
| 2 | 024–211 | 24 | 7F12 | 1,991 | One Fish, Two Fish, Blowfish, Blue Fish | 8.8 | 1,687 | 50,206 | 24.2 |
| 4 | 080–421 | 80 | 9F20 | 1,993 | Marge in Chains | 7.7 | 1,080 | 68,692 | 17.3 |
| 5 | 082–501 | 82 | 9F21 | 1,993 | Homer’s Barbershop Quartet | 8.4 | 1,416 | 58,390 | 19.9 |
| 5 | 084–503 | 84 | 1F02 | 1,993 | Homer Goes to College | 8.6 | 1,476 | 64,802 | 18.1 |
| 5 | 087–506 | 87 | 1F03 | 1,993 | Marge on the Lam | 8.0 | 1,132 | 53,490 | 21.7 |
| 5 | 089–508 | 89 | 1F06 | 1,993 | Boy-Scoutz ’n the Hood | 8.7 | 1,270 | 83,238 | 20.1 |
| 5 | 092–511 | 92 | 1F09 | 1,994 | Homer the Vigilante | 8.2 | 1,202 | 74,673 | 20.1 |
| 5 | 093–512 | 93 | 1F11 | 1,994 | Bart Gets Famous | 8.1 | 1,123 | 66,267 | 20.0 |
| 5 | 095–514 | 95 | 1F12 | 1,994 | Lisa vs. Malibu Stacy | 8.2 | 1,187 | 61,715 | 20.5 |
| 5 | 098–517 | 98 | 1F15 | 1,994 | Bart Gets an Elephant | 7.9 | 1,116 | 63,427 | 17.0 |
| 5 | 102–521 | 102 | 1F21 | 1,994 | Lady Bouvier’s Lover | 7.5 | 1,014 | 59,503 | 15.1 |
| 6 | 104–601 | 104 | 1F22 | 1,994 | Bart of Darkness | 8.6 | 1,330 | 65,126 | 15.1 |
| 6 | 107–604 | 107 | 2F01 | 1,994 | Itchy & Scratchy Land | 8.5 | 1,277 | 72,722 | 14.8 |
| 6 | 111–608 | 111 | 2F05 | 1,994 | Lisa on Ice | 8.4 | 1,236 | 63,564 | 17.9 |
| 6 | 114–611 | 114 | 2F08 | 1,994 | Fear of Flying | 7.8 | 1,100 | 61,569 | 15.6 |
| 6 | 116–613 | 116 | 2F10 | 1,995 | And Maggie Makes Three | 8.5 | 1,284 | 63,051 | 17.3 |
| 6 | 118–615 | 118 | 2F12 | 1,995 | Homie the Clown | 8.5 | 1,254 | 73,123 | 17.6 |
| 6 | 120–617 | 120 | 2F14 | 1,995 | Homer vs. Patty and Selma | 7.9 | 1,006 | 60,599 | 18.9 |
| 6 | 123–620 | 123 | 2F18 | 1,995 | Two Dozen and One Greyhounds | 8.1 | 1,051 | 62,323 | 11.6 |
| 6 | 125–622 | 125 | 2F32 | 1,995 | ’Round Springfield | 8.3 | 1,084 | 56,001 | 12.6 |
| 6 | 127–624 | 127 | 2F22 | 1,995 | Lemon of Troy | 8.6 | 1,285 | 70,698 | 13.1 |
| 7 | 130–702 | 130 | 2F17 | 1,995 | Radioactive Man | 8.3 | 1,172 | 62,390 | 15.7 |
| 7 | 132–704 | 132 | 3F02 | 1,995 | Bart Sells His Soul | 8.7 | 1,354 | 65,333 | 14.8 |
Dialogues
| line_number | episode_id | number | role | location | line |
|---|---|---|---|---|---|
| 1 | 32 | 209 | Miss Hoover | Springfield Elementary School | No, actually, it was a little of both. Sometimes when a disease is in all the magazines and all the news shows, it’s only natural that you think you have it. |
| 2 | 32 | 210 | Lisa Simpson | Springfield Elementary School | Where’s Mr. Bergstrom? |
| 3 | 32 | 211 | Miss Hoover | Springfield Elementary School | I don’t know. Although I’d sure like to talk to him. He didn’t touch my lesson plan. What did he teach you? |
| 4 | 32 | 212 | Lisa Simpson | Springfield Elementary School | That life is worth living. |
| 5 | 32 | 213 | Edna Krabappel-Flanders | Springfield Elementary School | The polls will be open from now until the end of recess. Now, just in case any of you have decided to put any thought into this, we’ll have our final statements. Martin? |
| 6 | 32 | 214 | Martin Prince | Springfield Elementary School | I don’t think there’s anything left to say. |
| 7 | 32 | 215 | Edna Krabappel-Flanders | Springfield Elementary School | Bart? |
| 8 | 32 | 216 | Bart Simpson | Springfield Elementary School | Victory party under the slide! |
| 9 | 32 | 218 | Lisa Simpson | Apartment Building | Mr. Bergstrom! Mr. Bergstrom! |
| 10 | 32 | 219 | Landlady | Apartment Building | Hey, hey, he Moved out this morning. He must have a new job – he took his Copernicus costume. |
| 11 | 32 | 220 | Lisa Simpson | Apartment Building | Do you know where I could find him? |
| 12 | 32 | 221 | Landlady | Apartment Building | I think he’s taking the next train to Capital City. |
| 13 | 32 | 222 | Lisa Simpson | Apartment Building | The train, how like him… traditional, yet environmentally sound. |
| 14 | 32 | 223 | Landlady | Apartment Building | Yes, and it’s been the backbone of our country since Leland Stanford drove that golden spike at Promontory point. |
| 15 | 32 | 224 | Lisa Simpson | Apartment Building | I see he touched you, too. |
| 16 | 32 | 226 | Bart Simpson | Springfield Elementary School | Hey, thanks for your vote, man. |
| 17 | 32 | 227 | Nelson Muntz | Springfield Elementary School | I didn’t vote. Voting’s for geeks. |
| 18 | 32 | 228 | Bart Simpson | Springfield Elementary School | Well, you got that right. Thanks for your vote, girls. |
| 19 | 32 | 229 | Terri/sherri | Springfield Elementary School | We forgot. |
| 20 | 32 | 230 | Bart Simpson | Springfield Elementary School | Well, don’t sweat it. Just so long as a couple of people did… right, Milhouse? |
| 21 | 32 | 231 | Milhouse Van Houten | Springfield Elementary School | Uh oh. |
| 22 | 32 | 232 | Bart Simpson | Springfield Elementary School | Lewis? |
| 23 | 32 | 233 | Bart Simpson | Springfield Elementary School | Somebody must have voted. |
| 24 | 32 | 234 | Milhouse Van Houten | Springfield Elementary School | What about you, Bart? Didn’t you vote? |
| 25 | 32 | 235 | Bart Simpson | Springfield Elementary School | Uh oh. |
| 26 | 32 | 237 | Wendell Borton | Springfield Elementary School | Yayyyyyyyyyyyyyy! |
| 27 | 32 | 238 | Bart Simpson | Springfield Elementary School | I demand a recount. |
| 28 | 32 | 239 | Edna Krabappel-Flanders | Springfield Elementary School | One for Martin, two for Martin. Would you like another recount? |
| 29 | 32 | 240 | Bart Simpson | Springfield Elementary School | No. |
| 30 | 32 | 241 | Edna Krabappel-Flanders | Springfield Elementary School | Well, I just want to make sure. One for Martin. Two for Martin. |
| 31 | 32 | 242 | Kid Reporter | Springfield Elementary School | This way, Mister President! |
| 32 | 32 | 244 | Conductor | Train Station | Now boarding on track 5, The afternoon delight coming to Shelbyville, Parkville, and….. |
| 33 | 32 | 245 | Lisa Simpson | Train Station | Mr. Bergstrom! Hey, Mr. Bergstrom! |
| 34 | 32 | 246 | BERGSTROM | Train Station | Hey, Lisa. |
| 35 | 32 | 247 | Lisa Simpson | Train Station | Hey, Lisa, indeed. |
| 36 | 32 | 248 | BERGSTROM | Train Station | What? What is it? |
| 37 | 32 | 249 | Lisa Simpson | Train Station | Oh, I mean, were you just going to leave, just like that? |
| 38 | 32 | 250 | BERGSTROM | Train Station | Ah, I’m sorry, Lisa. You know, it’s the life of the substitute teacher: he’s a fraud. Today he might be wearing gym shorts, tomorrow he’s speaking French, or, or, or pretending to know how to run a band saw, or God knows what. |
| 39 | 32 | 251 | Lisa Simpson | Train Station | You can’t go! You’re the best teacher I’ll ever have. |
| 40 | 32 | 252 | BERGSTROM | Train Station | Ah, that’s not true. Other teachers will come along who… |
| 41 | 32 | 253 | Lisa Simpson | Train Station | Oh, please. |
| 42 | 32 | 254 | BERGSTROM | Train Station | No, I can’t lie to you, I am the best. But, you know, they need me over in the projects of Capital City. |
| 43 | 32 | 255 | Lisa Simpson | Train Station | But I need you too. |
| 44 | 32 | 256 | BERGSTROM | Train Station | That’s the problem with being middle class. Anybody who really cares will abandon you for those who need it more. |
| 45 | 32 | 257 | Lisa Simpson | Train Station | I, I understand. Mr. Bergstrom, I’m going to miss you. |
| 46 | 32 | 258 | BERGSTROM | Train Station | I’ll tell you what… |
| 47 | 32 | 259 | BERGSTROM | Train Station | Whenever you feel like you’re alone and there’s nobody you can rely on, this is all you need to know. |
| 48 | 32 | 260 | Lisa Simpson | Train Station | Thank you, Mr. Bergstrom. |
| 49 | 32 | 261 | Conductor | Train Station | All aboard! |
| 50 | 32 | 262 | Lisa Simpson | Train Station | So, I guess this is it? It you don’t mind I’ll just run alongside the train as it speeds you from my life? |
Guests
| season | number | prod_code | title | guest_star | role | self |
|---|---|---|---|---|---|---|
| 1 | 002–102 | 7G02 | Bart the Genius | Marcia Wallace | Edna Krabappel-Flanders | FALSE |
| 1 | 002–102 | 7G02 | Bart the Genius | Marcia Wallace | Ms. Melon | FALSE |
| 1 | 003–103 | 7G03 | Homer’s Odyssey | Sam McMurray | Worker | FALSE |
| 1 | 003–103 | 7G03 | Homer’s Odyssey | Marcia Wallace | Edna Krabappel-Flanders | FALSE |
| 1 | 006–106 | 7G06 | Moaning Lisa | Miriam Flynn | Ms. Barr | FALSE |
| 1 | 006–106 | 7G06 | Moaning Lisa | Ron Taylor | Bleeding Gums Murphy | FALSE |
| 1 | 007–107 | 7G09 | The Call of the Simpsons | Albert Brooks | Cowboy Bob | FALSE |
| 1 | 008–108 | 7G07 | The Telltale Head | Marcia Wallace | Edna Krabappel-Flanders | FALSE |
| 1 | 009–109 | 7G11 | Life on the Fast Lane | Albert Brooks | Jacques | FALSE |
| 1 | 010–110 | 7G10 | Homer’s Night Out | Sam McMurray | Gulliver Dark | FALSE |
| 1 | 011–111 | 7G13 | The Crepes of Wrath | Christian Coffinet | Gendarme Officer | FALSE |
| 1 | 012–112 | 7G12 | Krusty Gets Busted | Kelsey Grammer | Sideshow Bob | FALSE |
| 1 | 013–113 | 7G01 | Some Enchanted Evening | June Foray | Babysitter service receptionist | FALSE |
| 1 | 013–113 | 7G01 | Some Enchanted Evening | June Foray | Doofy the Elf | FALSE |
| 1 | 013–113 | 7G01 | Some Enchanted Evening | Penny Marshall | Ms. Botz | FALSE |
| 1 | 013–113 | 7G01 | Some Enchanted Evening | Penny Marshall | Lucille Botzcowski | FALSE |
| 1 | 013–113 | 7G01 | Some Enchanted Evening | Paul Willson | Florist | FALSE |
| 2 | 014–201 | 7F03 | Bart Gets an “F” | Marcia Wallace | Edna Krabappel-Flanders | FALSE |
| 2 | 015–202 | 7F02 | Simpson and Delilah | Harvey Fierstein | Karl | FALSE |
| 2 | 016–203 | 7F04 | Treehouse of Horror | James Earl Jones | Removal man | FALSE |
| 2 | 016–203 | 7F04 | Treehouse of Horror | James Earl Jones | Serak the Preparer | FALSE |
| 2 | 016–203 | 7F04 | Treehouse of Horror | James Earl Jones | Narrator | FALSE |
| 2 | 018–205 | 7F05 | Dancin’ Homer | Tony Bennett | Himself | TRUE |
| 2 | 018–205 | 7F05 | Dancin’ Homer | Daryl Coley | Bleeding Gums Murphy | FALSE |
| 2 | 018–205 | 7F05 | Dancin’ Homer | Ken Levine | Dan Horde | FALSE |
| 2 | 018–205 | 7F05 | Dancin’ Homer | Tom Poston | Capital City Goofball | FALSE |
| 2 | 020–207 | 7F07 | Bart vs. Thanksgiving | Greg Berg | Rory | FALSE |
| 2 | 020–207 | 7F07 | Bart vs. Thanksgiving | Greg Berg | Eddie | FALSE |
| 2 | 020–207 | 7F07 | Bart vs. Thanksgiving | Greg Berg | Radio voice | FALSE |
| 2 | 020–207 | 7F07 | Bart vs. Thanksgiving | Greg Berg | “Hooray for Everything” Announcer | FALSE |
| 2 | 020–207 | 7F07 | Bart vs. Thanksgiving | Greg Berg | Security Man | FALSE |
| 2 | 020–207 | 7F07 | Bart vs. Thanksgiving | Carol Kane | Maggie Simpson | FALSE |
| 2 | 022–209 | 7F09 | Itchy & Scratchy & Marge | Alex Rocco | Roger Meyers Jr. | FALSE |
| 2 | 023–210 | 7F10 | Bart Gets Hit by a Car | Phil Hartman | Lionel Hutz | FALSE |
| 2 | 023–210 | 7F10 | Bart Gets Hit by a Car | Phil Hartman | Heaven | FALSE |
| 2 | 024–211 | 7F11 | One Fish, Two Fish, Blowfish, Blue Fish | Larry King | Himself | TRUE |
| 2 | 024–211 | 7F11 | One Fish, Two Fish, Blowfish, Blue Fish | Joey Miyashima | Toshiro | FALSE |
| 2 | 024–211 | 7F11 | One Fish, Two Fish, Blowfish, Blue Fish | Sab Shimono | Master Sushi Chef | FALSE |
| 2 | 024–211 | 7F11 | One Fish, Two Fish, Blowfish, Blue Fish | George Takei | Akira | FALSE |
| 2 | 024–211 | 7F11 | One Fish, Two Fish, Blowfish, Blue Fish | Diana Tanaka | Hostess | FALSE |
| 2 | 025–212 | 7F12 | The Way We Was | Jon Lovitz | Artie Ziff | FALSE |
| 2 | 025–212 | 7F12 | The Way We Was | Jon Lovitz | Mr. Seckofsky | FALSE |
| 2 | 026–213 | 7F13 | Homer vs. Lisa and the 8th Commandment | Phil Hartman | Troy McClure | FALSE |
| 2 | 026–213 | 7F13 | Homer vs. Lisa and the 8th Commandment | Phil Hartman | Moses | FALSE |
| 2 | 026–213 | 7F13 | Homer vs. Lisa and the 8th Commandment | Phil Hartman | Cable guy | FALSE |
| 2 | 027–214 | 7F15 | Principal Charming | Marcia Wallace | Edna Krabappel-Flanders | FALSE |
| 2 | 028–215 | 7F16 | Oh Brother, Where Art Thou? | Danny DeVito | Herbert Powell | FALSE |
| 2 | 029–216 | 7F14 | Bart’s Dog Gets an F | Tracey Ullman | Emily Winthropp | FALSE |
| 2 | 029–216 | 7F14 | Bart’s Dog Gets an F | Tracey Ullman | Sylvia Winfield | FALSE |
| 2 | 029–216 | 7F14 | Bart’s Dog Gets an F | Frank Welker | Santa’s Little Helper | FALSE |
Let us join dialogues with information contained in characters and episodes. This allows us to know the gender of the most talkative characters and the season of each episode. Notice that dialogues terminates at episode 568 (episode 16 of season 26), whereas episodes at episode 596 (last episode of season 27).
dialogues <- dialogues %>%
left_join(characters, by = c("role" = "name")) %>%
mutate(gender = fct_explicit_na(gender, na_level = "Unknown")) %>%
left_join(episodes[,c("season", "episode_id")], by = "episode_id")
dialogues <- dialogues %>%
mutate(season = ifelse(is.na(season), 8, season)) %>%
select(line_number, season, number = episode_id, location, role, gender, line)Let us also join episodes with information from guests. This will be useful to get, for instance, the guest star names for each episode (if any), and the roles they played. Note that guests terminates at episode 662 (episode 23 of season 30).
episodes <- episodes %>%
left_join(guests, by = c("number", "season", "prod_code", "title")) %>%
mutate(guest = ifelse(!is.na(role), TRUE, FALSE))Lastly, we tidy dialogues, which conveniently puts the speaking lines into a one-word-per-row format.
Characters
The Simpsons is known for its vast ensemble of leading and supporting characters. The characters data set collects the names of 6722 characters that appeared throughout the seasons. In 95% of the cases, the gender of the character is not recorded. However, this is not problematic since those characters are not particularly relevant to the development of the show, as they only account for about 15% of the whole dialogues.
Gender distribution
frequency_table <- function(df, group_var, align = NULL, prop = TRUE, head = nrow(df), caption = ""){
group_var <- enquo(group_var)
col.names <- c(firstup(as_label(group_var)), "Frequency")
table <- df %>%
group_by(!! group_var) %>%
summarize(n = n()) %>%
arrange(desc(n))
if(prop){
col.names <- c(col.names, "Proportion")
table <- table %>%
mutate(prop = n / sum(n),
prop = percent(prop)) %>%
arrange(desc(prop))
}
table %>%
show_table(col.names = col.names, align = align, head = head, caption = caption)
}
characters %>%
frequency_table(gender, align = c("l", "r", "r"),
caption = "The gender of the Simpsons characters throughout 26 seasons")| Gender | Frequency | Proportion |
|---|---|---|
| Unknown | 6,399 | 95.2% |
| Male | 252 | 3.7% |
| Female | 71 | 1.1% |
Contribution to dialogues
dialogues.tidy %>%
frequency_table(gender, align = c("l", "r", "r"),
caption = "How each gender contributes to the dialogues of 26 seasons")| Gender | Frequency | Proportion |
|---|---|---|
| Male | 842,670 | 63.9% |
| Female | 274,362 | 20.8% |
| Unknown | 202,416 | 15.3% |
The Simpsons is characterized by a marked gender imbalance, which is also reflected in the show’s writing staff. More than 75% of the characters with recorded gender are male. The only female leading characters are Marge and Lisa Simpson. In contrast, among the supporting cast, we find Edna Krappabel-Flanders (the teacher at Springfield Elementary School), and the twins Selma and Patty Bouvier (Marge’s older sisters).
lollipop <- function(df, x, fill.var, count = TRUE, index = NULL, title = "", ylab = "",
sub = NULL, pos = "bottom", labels = TRUE){
x <- enquo(x)
fill.var <- enquo(fill.var)
count_function <- function(df, x, y){
x <- enquo(x)
y <- enquo(y)
df %>%
count(!! x, !! y, sort = TRUE)
}
if(count){
df <- df %>%
count_function(x = !! x, y = !! fill.var)
}
if(!is.null(index)){
df <- df %>%
slice(index)
}
p <- df %>%
ggplot(aes(!! x, n, fill = !! fill.var)) +
geom_segment(aes(
x = fct_reorder(!! x, n),
xend = !! x,
y = 0,
yend = n,
color = !! fill.var),
size = 0.8,
show.legend = FALSE) +
geom_point(aes(
color = !! fill.var),
size = 4,
alpha = 0.6) +
coord_flip() +
labs(x = "", y = ylab, title = title, subtitle = sub) +
scale_fill_manual(name = firstup(as_label(fill.var)), values = rev(colors[1:2])) +
scale_color_manual(name = firstup(as_label(fill.var)), values = rev(colors[1:2])) +
theme(legend.position = pos)
if(labels){
p <- p +
scale_y_continuous(labels = comma)
}
p
}
plot.top.chars <- dialogues.tidy %>%
lollipop(x = role, fill.var = gender, index = c(1:10), title = "Top 10 Simpson leading characters",
ylab = "Number of spoken words in 26 seasons")
plot.supporting.chars <- dialogues.tidy %>%
lollipop(x = role, fill.var = gender, index = c(11:35), title = "The Simpsons supporting cast",
ylab = "Number of spoken words in 26 seasons")
ggarrange(plot.top.chars, plot.supporting.chars, nrow = 1, common.legend = TRUE, legend="bottom",
widths = c(0.95, 1.05))Let us look at how the total number of words of the top four leading characters (Homer, Marge, Bart, and Lisa Simpson) has evolved throughout the seasons.
Top20Chars <- dialogues.tidy %>%
count(role, sort = TRUE) %>%
distinct(role) %>%
head(20) %>%
pull(role)
dialogues.tidy %>%
filter(role %in% Top20Chars[1:4]) %>%
group_by(season, role) %>%
summarize(nwords = n()) %>%
ungroup() %>%
mutate(role = reorder(role, -nwords)) %>%
ggplot(aes(x = season, y = nwords, fill = role, color = role)) +
scale_y_continuous(labels = comma) +
scale_x_continuous(breaks = seq(1, 26, 5)) +
geom_col(alpha = 0.65) +
facet_wrap(~ role, nrow = 1) +
labs(x = "Season", y = "Number of spoken words per season",
title = "Homer Simpson has dominated all dialogues in every season") +
scale_fill_manual(name = "Character", values = palette[c(8,14,6,1)]) +
scale_color_manual(name = "Character", values = palette[c(8,14,6,1)]) +
theme(legend.position = "bottom")
Locations
The Simpsons show is mainly set in Springfield, a fictional town acting like a universe in which the characters can explore the issues faced by modern society. Although the locations data set reports 4459 distinct settings, most of the dialogues actually take place in way fewer places.
Let us have a look at the 20 most common locations, that is, the settings where the characters had the most dialogue. At the top, we find The Simpson home, followed by Springfield Elementary School, and Moe’s Tavern. The majority of these locations denote indoor settings.
dialogues.tidy %>%
count(location, sort = TRUE) %>%
head(20) %>%
add_column("Space" = c(rep("Indoor", 8), rep("Outdoor", 3), rep("Indoor", 3), "Outdoor",
rep("Indoor", 3), "Outdoor", "Indoor")) %>%
mutate(Space = as.factor(Space)) %>%
ggplot(aes(area = n, label = location, fill = Space, subgroup = location)) +
geom_treemap(alpha = 0.8) +
geom_treemap_subgroup_border(color = "black", size = 0.85) +
geom_treemap_text(place = "centre", size = 13,
grow = FALSE, reflow = TRUE) +
scale_fill_manual(values = colors[3:4]) +
theme(legend.position = "bottom") +
labs(title = "Top 20 locations",
subtitle = "Most dialogues take place at The Simpson home, and indoor.")It would be interesting to investigate which characters have the most dialogue within each of these locations. Let us consider for simplicity the top six locations.
top.locations <- dialogues.tidy %>%
count(location, word, sort = TRUE) %>%
group_by(location) %>%
summarize(n = n()) %>%
arrange(desc(n)) %>%
top_n(6, n) %>%
pull(location)
set.seed(2)
dialogues.tidy %>%
filter(location %in% top.locations) %>%
count(role, location, sort = TRUE) %>%
group_by(location) %>%
top_n(5, n) %>%
ungroup() %>%
mutate(location = fct_reorder(location, -n, sum)) %>%
ggplot(aes(reorder_within(role, n, location), n, fill = fct_reorder(role, n, sum))) +
geom_col(show.legend = FALSE) +
coord_flip() +
facet_wrap(~ location, scales = "free") +
scale_fill_manual(values = sample(palette)) +
labs(title = "Who speaks the most where?", y = "Number of spoken words throughout 26 seasons", x = "") +
scale_y_continuous(labels = comma) +
theme(panel.spacing = unit(0.5, "lines")) +
scale_x_reordered()Simpson home: it is the house of the Simpson family. Of course, the most talkative characters here are the Simpson family members, granddad included.
Springfield Elementary School: it is the local school on The Simpsons, attended by Bart and Lisa Simpson. Besides the two of them, the other leading characters include the principal Skinner, the superintendent Chalmers and the teacher Edna.
Moe’s Tavern: it is the local bar in Springfield. The dialogue here mostly occurs between the owner Moe and his guests Homer, Lenny, Carl, and Barney.
Springfield Nuclear Power Plant: it is the nuclear power plant in Springfield. The leading characters here are Mr. Burns, who owns the plant, his executive assistant Smithers, and the employees Homer, Lenny, and Carl.
Kwik-E-Mart: it is the convenience store run by Apu. The dialogues here usually involve the owner of the store and the members of the Simpson family.
First Church of Springfield: it is the main religious house in Springfield. Most of the dialogues here occur between the Reverend, the Simpson family, and their very religious next-door neighbor Ned.
Ratings and TV views
The episodes data set contains interesting information on the IMDb (Internet Movie Database) rating of each episode on a 1 - 10 scale, the number of votes it received, and the number of TV views in the United States.
Let us have a look at the ratings, the number of votes, and the TV views across episodes, as well as averaged by season.
Episodes
scatter_plot <- function(df, x, y, xlab = "Original air date", ylab, title,
breaks = NULL, limits = NULL, labels = comma){
x <- enquo(x)
y <- enquo(y)
p <- df %>%
ggplot(aes(!! x, !! y)) +
geom_point(size = 0.6) +
geom_smooth(method = "loess", formula = "y ~ x") +
labs(x = xlab, y = ylab, title = title)
if(!is.null(breaks)){
p <- p +
scale_y_continuous(labels = labels, breaks = breaks, limits = limits)
}else{
p <- p +
scale_y_continuous(labels = labels, limits = limits)
}
p
}
trend.rating <- episodes %>%
scatter_plot(year, rating, ylab = "Rating score", title = "IMDb ratings",
breaks = seq(0, 10, 2), limits = c(1, 10))
trend.votes <- episodes %>%
scatter_plot(year, votes, ylab = "Number of votes", title = "IMDb votes",
limits = c(0, 4000))
trend.views <- episodes %>%
scatter_plot(year, us_views, ylab = "Number of US viewers", title = "TV views in the US",
limits = c(0, 35), labels = unit_format(unit = "", scale = 1e+6, big.mark = ","))
grid.arrange(trend.rating, trend.votes, trend.views, nrow = 1, ncol = 3, widths = c(0.95, 0.97, 1.08))Seasons
line_plot <- function(data, x, y, xlab = "Season", ylab, title, sub,
limits = NULL, breaks = NULL, labels = comma){
x <- enquo(x)
y <- enquo(y)
p <- data %>%
ggplot(aes(!! x, !! y)) +
geom_line(size = 1.2, color = "#8d99ae") +
geom_point(shape=21, color=colors[1], fill=colors[1], size=1) +
scale_x_continuous(breaks = seq(1, 27, 4)) +
labs(x = xlab, y = ylab, title = title, subtitle = sub) +
theme(plot.subtitle=element_text(size=9))
if(!is.null(breaks)){
p <- p +
scale_y_continuous(labels = labels, breaks = breaks, limits = limits)
}else{
p <- p +
scale_y_continuous(labels = labels, limits = limits)
}
p
}
episodes_byseason <- episodes %>%
group_by(season) %>%
summarize(avg_rate = mean(rating),
avg_vote = mean(votes),
avg_views = mean(us_views))
plot.rating <- episodes_byseason %>%
line_plot(season, avg_rate, ylab = "Rating score", title = "IMDb ratings",
sub = "Averaged by season \nOlder seasons are the most appreciated.",
breaks = seq(0, 10, 2), limits = c(1,10))
plot.vote <- episodes_byseason %>%
line_plot(season, avg_vote, ylab = "Number of votes", title = "IMDb votes",
sub = "Averaged by season \nOlder seasons are the most rated.",
limits = c(0, 2000))
plot.view <- episodes_byseason %>%
line_plot(season, avg_views, ylab = "Number of US viewers", title = "TV views in the US",
sub = "Averaged by season \nOlder seasons are the most viewed.",
labels = unit_format(unit = "", scale = 1e+6, big.mark = ","),
limits = c(0, 30))
grid.arrange(plot.rating, plot.vote, plot.view, nrow = 1, widths = c(0.95, 0.97, 1.08))We notice an overall downward trend for all three indicators. Older seasons seem to be the most appreciated, most rated, and most viewed. As a matter of fact, The Simpsons show received acclaim throughout its first nine or ten seasons, which are generally considered its “Golden Age”, but has been criticized for a perceived decline in quality over the years.
To be fair, the availability in recent years of a variety of network channels and Internet streaming platforms has caused a systematic drop in the number of TV views not only for The Simpsons but also for many other shows.
Let us look at the pairwise-relationships between IMDb ratings, IMDb votes, and TV views. There seem to be positive linear relationships between each pair, suggesting that when one of those indicators increases, so do the other two.
line_smooth <- function(data, mapping, method="loess", ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point(...) +
geom_smooth(method=method)
p
}
bi_density_plot <- function(data, mapping, palette = 4, ...){
p <- ggplot(data, mapping = mapping) +
stat_density_2d(aes(fill = stat(level)), geom = "polygon") +
scale_fill_distiller(palette = palette, direction = 1)
p
}
episodes %>%
select("IMDb rating" = rating, "IMDb votes" = votes, "TV views in US (millions)" = us_views) %>%
ggpairs(upper = list(continuous = wrap(bi_density_plot, palette = 1, size = 0.2)),
diag = list(continuous = wrap("barDiag", fill = "#8d99ae", bins = 27)),
lower = list(continuous = wrap(line_smooth, color = "black", size = 0.2))) +
ggtitle(label = "Pairwise variable comparisons")The correlation matrix of episodes certainly shows strong inverse relationships between the episode number and the ratings, the votes, and the views. This indicates that more recent seasons generally have lower values of those performance indicators.
episodes %>%
select("Episode number" = episode_id, "IMDb rating" = rating, "IMDb votes" = votes, "TV views in US" = us_views) %>%
cor() %>%
ggcorrplot(type = "lower", colors = c("#6D9EC1", "white", "#E46726"), outline.col = "white",
legend.title = "Correlation", lab = TRUE, ggtheme = ggplot2::theme_light()) +
labs(title = "Correlation matrix")
Guest stars
In addition to the show’s regular cast of voice actors, celebrity guest stars have been a staple of The Simpsons since its first season. Guest voices have come from a wide range of professions, including actors, athletes, authors, musicians, artists, politicians, and scientists.
The guests data set contains information about the guest stars that took part in every episode and their role. Let us investigate the most recurring guest voices and roles.
Frequent guest stars
The most frequent guest stars across 30 seasons have been Marcia Wallace, Phil Hartman, and Maurice LaMarche. Whereas Marcia Wallace has almost always played the teacher Edna, the other two stars have actually voiced numerous characters on the show.
guests %>%
group_by(guest_star) %>%
summarize(unique_roles = paste(unique(role), collapse = '; '),
count = n()) %>%
arrange(desc(count)) %>%
show_table(head = 5, caption = "Top 5 guest stars and the roles they played",
col.names = c("Guest star", "Roles they played", "Number of appearences"),
align = c("l", "l", "r"))| Guest star | Roles they played | Number of appearences |
|---|---|---|
| Marcia Wallace | Edna Krabappel-Flanders; Ms. Melon; Mrs. Krabapatra | 175 |
| Phil Hartman | Lionel Hutz; Heaven; Troy McClure; Moses; Cable guy; Plato; Joey; Godfather; Horst; Stockbroker; Smooth Jimmy Apollo; Lyle Lanley; Security Guard; Mandy Patinkin; Tom; Eddie Muntz; Evan Conover; Charlton Heston; Fat Tony; Hospital chairman; Bill Clinton | 73 |
| Maurice LaMarche | George C. Scott; Hannibal Lecter; Captain James T. Kirk; Eudora Welty; Commander McBragg; Orson Welles; Recruiter #2; Cap’n Crunch; First Mate Billy; Oceanographer; Farmer; Horn Stuffer; Fox announcer; Government Official; Jock; Toucan Sam; Trix Rabbit; Dwight D. Eisenhower; City Inspector; Nuclear Power Plant Guard; David Starsky; Anthony Hopkins; Charlie Sheen; Prepper; Chef Naziwa; Karl Malden; John Kerry; Milo; Football Commentator; Clive Meriwether; Neil Simon; Rodney Dangerfield; Morbo; Hedonismbot; Lrrr | 38 |
| Joe Mantegna | Fat Tony; Himself playing Fat Tony; Fit Tony | 30 |
| Jon Lovitz | Artie Ziff; Mr. Seckofsky; Professor Lombardo; Aristotle Amadopolis; Mr. Devaro; Llewellyn Sinclair; Ms. Sinclair; Jay Sherman; Llewelyn Sinclair; Aristotle Amadopoulis; Enrico Irritazio; Cigarette; Himself; Hacky; Snitchy the Weasel; Rabbi | 28 |
Frequent roles
The most frequent roles played by guest stars are either themselves or some supporting characters, such as the teacher Edna, the gangster Fat Tony, and the actor Troy McClure.
guests %>%
frequency_table(role, head = 10, prop = FALSE, caption = "Top 10 guest star roles", align = c("l", "r"))| Role | Frequency |
|---|---|
| Himself | 336 |
| Edna Krabappel-Flanders | 173 |
| Herself | 59 |
| Fat Tony | 29 |
| Troy McClure | 29 |
| Lionel Hutz | 25 |
| Sideshow Bob | 21 |
| Themselves | 19 |
| Rabbi Hyman Krustofsky | 11 |
| Mona Simpson | 9 |
Who are the guest stars who played themselves in multiple episodes? At the top, we find the physicist and cosmologist Stephen Hawking with 4 appearances across 30 seasons, followed by the comic-book writer Stan Lee, the filmmaker Ken Burns, and the actor Gary Coleman, all of them with 3 occurrences. The gender imbalance in the original characters is also reflected in the guest appearances, with just 5 women playing themselves twice in 30 seasons. The majority of the guest stars, however, just appears in a single episode.
guests %>%
filter(self) %>%
count(guest_star, sort = TRUE) %>%
filter(n > 1) %>%
add_column(gender = c(rep("Male", 7), "Female", rep("Male", 7), rep("Female", 2), "Male",
"Female", rep("Male", 14), rep("Female", 1), rep("Male", 2))) %>%
lollipop(x = guest_star, fill.var = gender, ylab = "Number of guest appearences in 30 seasons",
count = FALSE, title = "Who has played themselves in multiple Simpsons episodes?")It would be interesting to see whether the number of guest appearances has changed over time. In the show’s early years, most guest stars have voiced original characters, but as the show has continued, the number of those appearing as themselves has increased, especially throughout seasons 12 to 19. In more recent seasons, the gap between the two conditions seems to have become more pronounced.
guests %>%
mutate(self = factor(self, levels = c(FALSE, TRUE), labels=c("Playing an original character", "Playing themselves"))) %>%
group_by(season, self) %>%
summarize(n = n()) %>%
ggplot(aes(season, n, color = self)) +
geom_line(size = 1.2) +
scale_color_manual(name = "Guest star", values = colors[3:4]) +
scale_x_continuous(breaks = seq(1, 30, 2)) +
theme(legend.position = "bottom") +
labs(x = "Season", y = "Number of guest appearences",
title = "The number of guest appearances over seasons")When guest stars are voicing a character, how long do they talk? Let us explore the average number of lines reserved for guest stars. The guest stars with the most lines per episode are the ones voicing a narrator or an announcer and are usually not playing themselves. The only exception of a guest star playing themselves and having a fair amount of lines has been Lady Gaga.
guests <- guests %>%
mutate(role = ifelse(self, guest_star, role))
guests_summarized <- guests %>%
filter(season <= 27) %>%
group_by(guest_star, role, self) %>%
summarize(nb_episodes = n())
guest_roles <- guests_summarized %>%
inner_join(dialogues %>%
count(role, sort = TRUE, name = "nb_lines"),
by = "role") %>%
mutate(lines_per_episode = nb_lines/ nb_episodes)
guest_roles %>%
arrange(desc(lines_per_episode)) %>%
show_table(head = 15, align = c("l", "l", rep("r", 4)),
col.names = c("Guest star", "Role", "Playing themselves", "Number of episodes",
"Number of lines", "Number of lines per episode"),
caption = "Top 15 guest stars with the highest number of lines per episode")| Guest star | Role | Playing themselves | Number of episodes | Number of lines | Number of lines per episode |
|---|---|---|---|---|---|
| Larry McKay | Announcer | FALSE | 1 | 386 | 386 |
| Matt Groening | Announcer | FALSE | 1 | 386 | 386 |
| Phil Hartman | Fat Tony | FALSE | 1 | 276 | 276 |
| Clarence Clemons | Narrator | FALSE | 1 | 156 | 156 |
| Daniel Stern | Narrator | FALSE | 1 | 156 | 156 |
| George Fenneman | Narrator | FALSE | 1 | 156 | 156 |
| Jim Forbes | Narrator | FALSE | 1 | 156 | 156 |
| Ken Burns | Narrator | FALSE | 1 | 156 | 156 |
| Marc Wilmore | Narrator | FALSE | 1 | 156 | 156 |
| Matt Dillon | Louie | FALSE | 1 | 104 | 104 |
| Greg Berg | Eddie | FALSE | 1 | 96 | 96 |
| James Earl Jones | Narrator | FALSE | 2 | 156 | 78 |
| Lady Gaga | Lady Gaga | TRUE | 1 | 78 | 78 |
| Kristen Wiig | Annie Crawford | FALSE | 1 | 74 | 74 |
| Steve Carell | Dan Gillick | FALSE | 1 | 64 | 64 |
Guest stars playing themselves tend to have fewer lines than those playing an actual character on the show.
guest_roles %>%
mutate(self = ifelse(self, "Playing themselves", "Playing an original character")) %>%
ggplot(aes(lines_per_episode)) +
geom_histogram(aes(fill = self), binwidth = 2, center = 1, show.legend = FALSE) +
facet_wrap(~ self, ncol = 2) +
scale_fill_manual(values = colors[3:4]) +
labs(x = "Number of lines per episode", y = "Frequency",
subtitle = "Most guest stars, especially those playing themselves, have relatively few lines per episode")Is there a difference in IMDb ratings, IMDb votes, and TV views in the episodes with guest stars playing themselves versus playing an original character? Somehow. It seems that episodes with guests starring themselves have, on average, lower ratings, votes, and views than the episodes with no guest star. According to the two-samples Wilcoxon test, the differences in the mean levels are statistically significant at a 5% level.
episodes_by_self <- episodes %>%
filter(!is.na(self)) %>%
group_by(self)
episodes_by_self %>%
summarize(avg_rating = mean(rating),
avg_votes = mean(votes),
avg_views = mean(us_views)) %>%
mutate(self = ifelse(self, "Playing themselves", "Playing an original character")) %>%
add_row(self = "p-value",
avg_rating = wilcox.test(rating ~ self, data = episodes_by_self, exact = FALSE)$p.value,
avg_votes = wilcox.test(votes ~ self, data = episodes_by_self, exact = FALSE)$p.value,
avg_views = wilcox.test(views ~ self, data = episodes_by_self, exact = FALSE)$p.value) %>%
show_table(col.names = c("Guest star", "IMDb rating", "IMDb votes", "TV views in US (millions)"),
align = c("l", rep("r", 3)), digits = 3,
caption = "Average performance indicators and p-values of Wilcoxon rank sum test")| Guest star | IMDb rating | IMDb votes | TV views in US (millions) |
|---|---|---|---|
| Playing an original character | 7.384 | 836.552 | 11.763 |
| Playing themselves | 7.275 | 777.821 | 11.216 |
| p-value | 0.007 | 0.037 | 0.003 |
plot_violin <- function(df, x, y, ylab, title = "", limits = NULL,
breaks = NULL, labels = comma){
x <- enquo(x)
y <- enquo(y)
data_summary <- function(x) {
m <- mean(x)
ymin <- m-sd(x)
ymax <- m+sd(x)
return(c(y=m,ymin=ymin,ymax=ymax))
}
p <- df %>%
filter(!is.na(!! x)) %>%
group_by(!! x) %>%
ggplot(aes(!! x, !! y, fill = !! x)) +
geom_violin() +
scale_fill_manual(name = "Guest star", values = colors[3:4],
labels = c("FALSE" = "Playing an original character", "TRUE" = "Playing themselves")) +
scale_x_discrete(labels = c("FALSE" = "", "TRUE" = "")) +
labs(x = "", y = ylab, title = title) +
stat_summary(fun.data = data_summary, geom = "pointrange", color = "black",
show.legend = FALSE)
if(!is.null(breaks)){
p <- p +
scale_y_continuous(labels = labels, breaks = breaks, limits = limits)
}else{
p <- p +
scale_y_continuous(labels = labels, limits = limits)
}
p
}
plot.guest.ratings <- episodes %>%
plot_violin(x = self, y = rating, ylab = "Rating score", title = "IMDb ratings",
limits = c(1, 10), breaks = seq(0, 10, 2))
plot.guest.votes <- episodes %>%
plot_violin(x = self, y = votes, ylab = "Number of votes", title = "IMDb votes",
labels = comma, limits = c(0, 4000))
plot.guest.views <- episodes %>%
plot_violin(x = self, y = us_views, ylab = "Number of US viewers", title = "TV views in the US",
labels = unit_format(unit = "", scale = 1e+6, big.mark = ","),
limits = c(0, 35))
ggarrange(plot.guest.ratings, plot.guest.votes, plot.guest.views, nrow = 1, common.legend = TRUE,
legend="bottom", widths = c(0.95, 0.97, 1.08))
Text analysis
Let us now carry out some text analysis on dialogues. In this scenario, the dialogues of each character are acting as the documents of the corpus.
Word frequency
Frequent words
Let us have a look at the most frequent words. By choosing a distinct combination of role, word, and line number, we are preventing from counting the same word from the same line multiple times. The most recurrent words - after removing the stop words - seem to be related to the characters addressing each other.
dialogues.tidy <- dialogues.tidy %>%
anti_join(stop_words, by = "word")
dialogues.summarized <- dialogues.tidy %>%
distinct(role, line_number, word, gender) %>%
count(role, word, gender, sort = TRUE)
dialogues.summarized %>%
show_table(head = 15, col.names = c("Character", "Word", "Gender", "Frequency"),
align = c("l", "r", "r"), caption = "Top 15 most frequent words")| Character | Word | Gender | Frequency |
|---|---|---|---|
| Homer Simpson | marge | Male | 1,752 |
| Marge Simpson | homer | Female | 1,319 |
| Lisa Simpson | dad | Female | 1,076 |
| Homer Simpson | hey | Male | 933 |
| Bart Simpson | dad | Male | 876 |
| Lisa Simpson | bart | Female | 708 |
| Homer Simpson | gonna | Male | 694 |
| Homer Simpson | yeah | Male | 691 |
| Bart Simpson | hey | Male | 662 |
| Lisa Simpson | mom | Female | 612 |
| Homer Simpson | uh | Male | 607 |
| Homer Simpson | boy | Male | 583 |
| Marge Simpson | bart | Female | 570 |
| Homer Simpson | time | Male | 558 |
| Marge Simpson | homie | Female | 525 |
Peculiar words
Let us compute the term frequency (tf), the inverse document frequency (idf), and the tf-idf. The latter looks for the most important words in each document that are not too common in other documents. In our case, this means finding the words that are peculiar to a particular character, but generally not to other characters.
role.specificity <- dialogues.summarized %>%
group_by(role) %>%
mutate(total_words = sum(n)) %>%
ungroup() %>%
bind_tf_idf(word, role, n) %>%
arrange(desc(tf_idf))We can use the tf-idf as a catchphrase detector. Specifically, we are looking at the characters with a fair amount of dialogues (i.e., more than 500 words), and keep one row for each character (to find one peculiar word for every role). For some characters, the peculiar word is the name of the character they usually talk to (e.g., Smithers saying ‘sir’ or Agnes Skinner saying ‘Seymour’). In contrast, for others, it is either the word they use to introduce themselves (e.g., Troy McClure saying ‘I’m Troy McClure’) or recurring sounds (e.g., the Captain going ‘Arrr’ or Nelson ‘haw’).
role.specificity %>%
filter(total_words >= 500) %>%
distinct(role, .keep_all = TRUE) %>%
mutate(role_word = paste0(role, ": ", word)) %>%
head(20) %>%
mutate(role_word = fct_reorder(role_word, tf_idf)) %>%
rename(freq = n, n = "tf_idf") %>%
lollipop(x = role_word, fill.var = gender, count = FALSE, ylab = "TF-IDF", pos = "right",
title = "Using TF-IDF as a catchphrase detector", labels = FALSE,
sub = "Top 20 characters speaking at least 500 words in 27 seasons.")Bigrams Analysis
Let us now focus on the bigrams, that is, the pairs of words that often occur together.
Frequent bigrams
The most recurrent bigrams concern the members of the Simpson family (e.g., ‘homer simpson’, ‘bart simpson’, ‘lisa simpson’) or some onomatopoeia (e.g., ‘woo hoo’, ‘hey hey’, ‘la la’).
dialogue_bigram <- dialogues %>%
unnest_tokens(bigram, line, token = "ngrams", n = 2)
dialogue_filtered <- dialogue_bigram %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word & !is.na(word1)) %>%
filter(!word2 %in% stop_words$word & !is.na(word2))
bigram_counts <- dialogue_filtered %>%
count(word1, word2, sort = TRUE)
bigram_united <- dialogue_filtered %>%
unite(bigram, word1, word2, sep = " ")
bigram_united %>%
count(bigram, sort = TRUE) %>%
show_table(head = 10, col.names = c("Bigram", "Frequency"), align = c("l", "r"),
caption = "Top 10 bigrams throughout 26 seasons")| Bigram | Frequency |
|---|---|
| homer simpson | 461 |
| woo hoo | 360 |
| hey hey | 311 |
| la la | 268 |
| bart simpson | 258 |
| heh heh | 221 |
| ha ha | 215 |
| uh huh | 210 |
| haw haw | 184 |
| lisa simpson | 174 |
Peculiar bigrams
The peculiar bigrams can be found as the bigrams with the largest tf-idf and that occur over 50 times. At the top, we find the signature mocking laugh of Nelson ‘haw haw’, and ‘kent brockman’ as the TV announcer is always starting off with ‘This is Kent Brockman’.
bigram_tf_idf <- bigram_united %>%
count(role, bigram) %>%
bind_tf_idf(bigram, role, n) %>%
arrange(desc(tf_idf))
bigram_tf_idf %>%
filter(n > 50) %>%
distinct(role, .keep_all = TRUE) %>%
show_table(col.names = c("Character", "Bigram", "Frequency", "tf", "idf", "tf-idf"), align = c("l", rep("r", 5)),
caption = "Peculiar bigrams occurring over 50 times throughout 26 seasons")| Character | Bigram | Frequency | tf | idf | tf-idf |
|---|---|---|---|---|---|
| Nelson Muntz | haw haw | 133 | 0.13 | 4.83 | 0.62 |
| Kent Brockman | kent brockman | 70 | 0.02 | 5.45 | 0.13 |
| Krusty the Clown | hey hey | 80 | 0.03 | 4.09 | 0.13 |
| Moe Szyslak | hey hey | 57 | 0.02 | 4.09 | 0.06 |
| Homer Simpson | woo hoo | 311 | 0.01 | 5.15 | 0.06 |
| Bart Simpson | hey dad | 55 | 0.00 | 6.50 | 0.03 |
Networks
The relationships across the bigrams can be depicted through a network plot. To keep the plot readable, we consider bigrams that occurred at least 30 times. The nodes from which most of the arrows are departing seem to be ‘simpson’, ‘dollars’, and ‘hoo’. All in all, the most common bigrams seem to refer to either character names, locations, or onomatopeia.
bigram.graph <- bigram_counts %>%
filter(n > 30) %>%
graph_from_data_frame()
set.seed(1234)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram.graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE, arrow = a, end_cap = circle(.07, "inches")) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()Sentiment Analysis
Let us carry out a sentiment analysis to explore the feelings that emerge from The Simpsons dialogues. We are using the ‘bing’ lexicon, which attributes a positive or negative valence to every word in its vocabulary.
Which words contribute the most to the positive and negative sentiments?
Jointly
To get more insightful results, we only consider the words occurring at least 400 times. Among the positive words, we find e.g., ‘love’, ‘wow’, ‘nice’, and ‘fine’, whereas among the negative ones ‘bad’, ‘burns’, ‘stupid’, and ‘kill’. The word ‘burns’ is being associated with a negative sentiment because it is seen as coming from the verb ‘to burn’, which clearly has a negative connotation. In the Simpsons’ case though, the word ‘burns’ is likely to just refer to the character called Mr. Burns. However, due to the evil and greedy nature of the character himself, the graph seems pretty accurate after all!
simpsons_sentiments <- dialogues.tidy %>%
inner_join(get_sentiments("bing"), by = "word")
bar_chart_sentiment <- function(df, x = NULL, y, z, slice = NULL, count = TRUE){
y <- enquo(y)
z <- enquo(z)
if(count){
x <- enquo(x)
df <- df %>%
count(!! x, !! y, !! z) %>%
mutate(n = ifelse(!! z == "negative", -n, n),
!! x := reorder(!! x, -abs(n), sum),
!! y := reorder_within(!! y, n, !! x)) %>%
group_by(!! x) %>%
arrange(desc(abs(n)))
}
if(!is.null(slice)){
df <- df %>%
slice(slice)
}
df %>%
ggplot(aes(!! y, n, fill = !! z)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(name = "Sentiment", values = c("#f1776a", "#8bc384")) +
labs(x = "Word", y = "Contribution to sentiment") +
scale_x_reordered() +
theme(legend.position = "bottom")
}
simpsons_sentiments %>%
count(sentiment, word) %>%
ungroup() %>%
filter(n > 400) %>%
mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
mutate(word = reorder(word, n)) %>%
bar_chart_sentiment(y = word, z = sentiment, count = FALSE)Per character
The words most responsible for the positive (e.g., ‘love’, ‘nice’, ‘fine’) and negative sentiments (e.g., ‘bad’, ‘wrong’) do not seem to depend that much on the character.
simpsons_sentiments %>%
filter(role %in% Top20Chars[1:6]) %>%
bar_chart_sentiment(x = role, y = word, z = sentiment, slice = c(1:12)) +
facet_wrap(~ role, scales = "free")Word clouds
Let us now have a look at some word clouds, which give us an idea of the most recurrent words throughout the seasons. To get more insightful results, we remove from dialogues.tidy some onomatopoeia.
Jointly
plot_wordcloud <- function(df, x, y, head = TRUE, head.size = 50,
shape = "diamond", max.size = 20){
x <- enquo(x)
y <- enquo(y)
if(head){
df <- df %>%
head(head.size)
}
df %>%
ggplot(aes(label = !! x, size = !! y, shape = shape,
color = factor(sample.int(10, length(!! x), replace = TRUE)))) +
geom_text_wordcloud(rm_outside = TRUE) +
scale_size_area(max_size = max.size)
}
custom_stopwords <- stop_words %>%
add_row(word = c("hey", "gonna", "yeah", "uh", "ya", "ho", "la", "em",
"ah", "huh", "ooh", "gotta", "eh", "aw", "heh", "wow",
"ow", "haw", "woo", "ha", "wanna", "whoa", "hoo", "ye", "wait"))
dialogues.tidy <- dialogues.tidy %>%
anti_join(custom_stopwords, by = "word")
set.seed(1234)
dialogues.tidy %>%
count(word, sort = TRUE) %>%
plot_wordcloud(x = word, y = n) +
theme_minimal()Per character
Let us depict a separate word cloud for the most talkative characters of the show, to get an insight into their most recurrent ‘themes’. The most recurrent words for each character are the ones used for interacting with the other characters on the show. As we would expect, we discover a broad theme revolving around school and friends for Bart and Lisa Simpson, around the bar for Moe, and the nuclear plant for Mr. Burns.
dialogues.tidy %>%
filter(role %in% Top20Chars[1:6]) %>%
count(role, word, sort = TRUE) %>%
group_by(role) %>%
arrange(desc(n)) %>%
slice(1:20) %>%
distinct(word, .keep_all = TRUE) %>%
mutate(prop = n() / sum(n())) %>%
ungroup() %>%
mutate(role = reorder(role, -n, sum)) %>%
plot_wordcloud(x = word, y = abs(prop), head = FALSE, shape = "circle", max.size = 5) +
facet_wrap(~role)Topic Modelling
We conclude this analysis with some topic modelling. To this end, we need to construct the term document matrix of dialogues. It turns out that considering the show episodes as documents does not provide much insight, as each episode touches on various topics. Therefore, we consider as a document the lines pronounced by a certain character.
To get more meaningful results, when constructing the document term matrix, we only keep the words occurring at least 10 times, and whose tf-idf is higher than the 70% quantile.
dialogues.tidy.lda <- dialogues.tidy %>%
select(role, word) %>%
na.omit()
y <- document_term_frequencies(dialogues.tidy.lda)
dtm.y <- document_term_matrix(y)
dtm.y <- dtm_remove_lowfreq(dtm.y, minfreq = 10)
y <- dtm_remove_tfidf(dtm.y, prob = 0.7)Let us perform a Latent Dirichlet Allocation (LDA) analysis on the document term matrix. We allow a large number of hidden topics, say eight.
The plots below shows the 10 most representative words for each topic, and the four characters with the highest probabilities of belonging to each topic. Inspecting them will allow us to give a meaning to the topics, and find the underlying traits shared by groups of characters.
simpsons_topics <- simpsons_lda %>%
tidy(matrix = "beta")
simpsons_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta) %>%
mutate(term = reorder_within(term, beta, topic),
topic = factor(topic, levels = c(1:n_topics),
labels = paste("Topic", 1:n_topics))) %>%
ggplot(aes(term, beta, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~topic, scales = "free", nrow = 2) +
coord_flip() +
scale_fill_manual(values = palette2) +
scale_x_reordered() +
labs(x = "Term", y = "Per-topic-per-word probability",
title = "Word-Topic probabilities", subtitle = "Top 10 words for every topic")simpsons_gamma <- simpsons_lda %>%
tidy(matrix = "gamma")
simpsons_classification <- simpsons_gamma %>%
group_by(document) %>%
top_n(1, gamma) %>%
ungroup()
top_chars_by_topic <- simpsons_classification %>%
group_by(topic) %>%
top_n(4, gamma) %>%
arrange(topic, -gamma) %>%
ungroup() %>%
select(document)
simpsons_gamma %>%
inner_join(top_chars_by_topic, by = c("document")) %>%
mutate(document = factor(document, levels = rev(top_chars_by_topic$document))) %>%
ggplot(aes(document, topic, fill = gamma)) +
geom_tile() +
coord_flip() +
scale_y_continuous(breaks = seq(1:n_topics)) +
scale_fill_gradient(name = "Per-character-per-topic probability", low="white", high="blue") +
labs(x = "", y = "Topic", title = "Character-Topic probabilities",
subtitle = "Top 4 characters for every topic") +
theme(legend.position = "bottom")| Topics | Interpretation |
|---|---|
| 1 | It might call on the social life of Homer Simpson, as indicated by the words ‘moe’, ‘beer’, and ‘money’. Among the characters classified to this topic, we find the bar owner Moe, and Homer’s friends Lenny, Barney, and Carl. Ned Flanders is also allocated here, which explains the presence of words like ‘god’, ‘love’, and ‘lord’. |
| 2 | The words ‘sir’ and ‘chief’ seem to suggest employer-employee relationships. The top characters allocated to this topic include Smithers, Mr. Burns, Lou, Eddie, and Chief Wiggum. Smithers is Burns’ trusted personal assistant, whereas Lou and Eddie are the two police officers who aid Chief Wiggum on almost every mission. |
| 3 | It seems to revolve around school, as demonstrated by the words ‘children’, ‘willie’, ‘edna’, and ‘principal’. As we could expect, the top characters for this topic are all connected to Springfield Elementary School. They include the principal Skinner, the teachers Edna Krabappel-Flanders and Miss Hoover, the superintendent Chalmers, and the weird student Ralph Wiggum. |
| 4 | The meaningful words of this topic (‘homie’, ‘honey’, and ‘husband’) touch on family and marital aspects. Interestingly enough, we find not only Marge, but also Mona Simpson (Homer’s mother), Selma (Marge’s sister), and Manjula (Apu’s wife). |
| 5 | The words ‘lisa’, and ‘milhouse’ possibly hint to kid-related topics. The top characters include Lisa, Milhouse, young Karl, and young Lenny. |
| 6 | We read ‘marge’, ‘boy’, ‘bart’, ‘lisa’, ‘stupid’, and ‘flanders’, and we can’t help thinking about Homer’s world! Homer is, of course, the main character of this topic, followed by other secondary characters. |
| 7 | We can be confident that the words ‘cool’, ‘krusty’, ‘boys’, and ‘milhouse’ revolve around the social life of Bart Simpson. Besides Bart, the other main characters are his bully friends Nelson, Jimbo, and Kearney, the recidivist criminal Snake Jailbird, and the drug-addict bus-driver Otto Man. What a crew! |
| 8 | The words ‘live’, ‘story’, and ‘coming’ allude to the theme of news and tv coverage. Local news often reports about Krusty the Clown, which is why the top two words refer to him. The main characters belonging to this topic are the narrator, some announcers, the news reporter Kent Brockman, and the sea Captain Horatio McCallister. |
Conclusion
This analysis gave us some pretty interesting insights on The Simpsons show! We discovered the most popular characters, and the locations where they usually interact. We then analyzed the ratings, votes, and views of the episodes across 27 seasons. We also found the most recurring guest stars, and evaluated whether their presence had any impact on the number of lines or ratings. Thanks to the scripts’ availability, we carried out some text analysis that, among other things, allowed us to inspect the underlying sentiments and the main topics of the show.